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Abstract 

We describe an algorithm for dynamic load balancing of geometrically paral- 
lelized synchronous Monte Carlo simulations of physical models. This algorithm 
is designed for a (heterogeneous) multiprocessor system of the MIMD type with 
distributed memory. The algorithm is based on a dynamic partitioning of the 
domain of the algorithm, taking into account the actual processor resources of 
the various processors of the multiprocessor system. 



Keywords: Monte Carlo Simulation; Geometric Parallelization; Synchronous 
Algorithms; Dynamic Load Balancing; Dynamic Resizing 



IBM preprint 75.93.08, |hep-lat/9310"021 



*altevogt@dhdibnil.bitnet 



1 



1 Introduction 



During the last years, Monte Carlo simulations of scientific problems have turned out 
to be of outstanding importance |1|, ^ |], 0. It is now a general belief within the 
community of computational scientists that multiprocessor systems of the MIMD[]type 
with distributed memory are the most appropriate computer systems to provide the 
computational resources needed to solve the most demanding problems e.g. in High 
Energy Physics, Material Sciences or Meteorology. 

Implementing a synchronous parallel algorithm on a heterogeneous MIMD system with 
distributed memory (e.g. on a cluster of different workstations), a load balancing be- 
tween the processors of the system (taking into account the actual resources being 
available on each node) turns out to be crucial, because the processor with the least 
resources determines the speed of the complete algorithm. 

The heterogenity of the MIMD system may not only result because of heterogeneous 
hardware resources, but also due to a heterogeneous use of homogeneous hardware re- 
sources (e.g. on a workstation cluster, there may exist several serial tasks running on 
some of the workstations of the cluster for some time in addition to the parallel appli- 
cation; this results in a temporary heterogenity of the cluster, even if the workstations 
of the cluster are identical). This kind of heterogenity can in general only be detected 
during the runtime of the parallel algorithm. 

Therefore, the usual approach of geometric parallelization 0| to parallelize algo- 
rithms by a static decomposition of a domain into subdomains and associating each 
subdomain with a processor of the multiprocessor system is not appropriate for a het- 
erogeneous multiprocessor system. Instead, on a heterogeneous system this geometric 
parallelization should be done dynamic. 

In the sequel, we consider geometrically parallelized Monte Carlo simulations consisting 
of update algorithms (e.g. Metropolis or heatbath algorithms) defined on e.g. spins at 
the sites of a lattice (e.g. in Ising models), matrices defined on the links of the lattice 
(e.g. in Lattice Gauge Theories), etc.. In general, a synchronization of the parallel 
processes takes place after each sweep through the lattice (each iteration). 

For this class of simulations, we will introduce an algorithm for dynamic load balancing. 
Implementing and testing the algorithm for the two-dimensional Ising model, it will 
be shown, that this algorithm may drastically improve the performance of the Monte 
Carlo simulation on a heterogeneous multiprocessor system. 

The paper consist of basically three parts. In the first part we will introduce a sim- 
ple model for analyzing the performance of synchronous Monte Carlo simulations on 
multiprocessor systems with distributed memory, in the second part we will present 
our algorithm for the dynamic load balancing and finally we will present our numerical 

^Multiple Instruction stream, Multiple Data stream. For an excellent overview on the various 
models of computation like SISD, SIMD, etc., see 
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results. 



2 The Performance Model 

We consider a heterogeneou^ multiprocessor system consisting of n processors. For 
a parallelized Monte Carlo simulation we measure^ at times Imc the times At*^'^"^ the 
simulation has taken for a fixed number of sweeps on each of the processors^. The 
parallelization is done by geometric parallelization, associating a sublattice i with a 
characteristic scale L^'^"^ (e.g. a characteristic side of the sublattice, its volume, etc..) 
with each of the processors. These scales are choosen such that 



holds. Here L denotes the scale of the complete lattice and the c*"'^ are real numbers 
with < c-""^ < 1 and Y.i (^i^'"^ = iQ- Using these parameters, we can calculate 
quantities Pl^'"-^ , characterizing the computing resources of processor i at time Imc'- 



Jmc 
ytMC ._ 

tMC ^ ' 



Assuming, that the resources of the processors vary slowly compared with the time the 
simulation takes for one sweep 



i i 

we set 



(3) 



p. ^ ptMC ^ (4) 

Now we reinterpret formula (^: For fixed Pi we want to calculate a set of {c**^'^'*'^}, 
such that the time for the next sweep {excluding the time spent for communication)^ 



At({cj}) := max Ati(Q) with Atj(ci) := (5) 

Pi 

for i = 1 , . . . , n has a minimal value 
^In the sense of the introduction. 

■^Using e.g. a hbrary routine provided by the operating system. 

^Here the times Ati denote "wall clock times" measured in seconds and t mc denotes the "internal" 
time of the simulation, measured in numbers of Monte Carlo sweeps. 

^Thc initial choice of the parameters i*J>fc-0 respectively of the c\"'^~'^ is quite arbitrary, one 
could choose e.g. c]"^~''^ = - for all i. 

^From now on throughout this section we always consider the times At, Ati, etc. and the coefficients 
{ci} to be taken at iMC + 1- For the sake of clarity we therefore drop this index throughout this section. 



3 



Atmin ■= minAt{{ci}). (6) 

A necessary condition for this solution is obviously, that all Ati must be equal[} Re- 
membering the normalization condition on the constants Cj, we arrive at 

with i = 1, ...,n. Using (|[) this results in 

Atmin = ^ p ■ (8) 

For a homogenous system (all Pi equal) (0) would give 

c. = - (9) 

n 

For a heterogenous system this choice of {q} results in (using (|^)) 

At({c. = -}) = (10) 

As the homogenity H of the multiprocessor system we define therefore the ratio of 
Atmin with (Ip: 

H = n (11) 

with < < 1. The speedup S* that can be obtained by the dynamic resizing of the 
sublattices is the inverse of H: 



S^. (12) 
Rewriting (P) in terms of we arrive at 

1 H 

Atmin — : 77; (13) 

n mmj Fi 

For later comparison with our experimental data, let us look at the special case P : = 
Pi = ... = Pn-i > Pn ='■ Pmin- In this casc we have for Atmin'- 



''Let us assume that e.g. Ati > Ai2. Then we could easily make Aii smaller by a redefinition of 
ci and C2 with ci + C2 = const. 
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,n-H 



n 



(14) 



with At^ = Without load balancing, the time for the total simulation will be 

determined by the time spent on processor n: 



At. 



1 



nPn 



n-H 
{n - 1)H 



Strain • 

Figure ^ shows the "optimal" curve (using dynamic load balancing): 

At^ _ n-1 

At'fYiin ^ H 

and the one obtained without any load balancing: 

At^ n - 1 



(15) 



(16) 




for n = 4. These curves will be compared with our experimental results. 
1 
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Figure 1: Performance predicted by our model for 4 processors with and without load 
balancing. 



'Using P„ 



~ n-H ' 
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3 The Algorithm 

In this section we describe our algorithm to perform the dynamic load balancing, based 
on the performance model described above. 



3.1 The Input: 

- A characteristic scale L of the lattice (e.g. a side length of the lattice). 

- The number n of processors of the multiprocessor system. 

- The total number of iterations nuer to be done by the simulation. 

- The number of iterations Uresize after which a resizing of the sublattices may take 
place. 

- A control parameter e with < e < 1 to determine if a resizing should be done. 

3.2 The Output: 

- A dynamic resizing of the domains associated with each processor of the multi- 
processor system, taking into account the actual resources of the processors. 

3.3 Formal Steps: 

1. Read the input. 

2. Introduce characteristic scales L^^"^ of the sublattices with i = 1, .., n and Imc = 
1, ..,niter/nresize, whcrc i dcuotes the processors and tMc counts the number of 
resizings that have been done. 

3. Calculate the initial characteristic sizes of the sublattices for all processors 
according to lI"'^^^ = ^. 

4. Associate each of the sublattices with one of the processors. 

5. Do on each processor z = 1, .., n (in parallel) 

{ 

- Set tMc = 0. 

- For m = 1, ..,niter: 

{ 

- Perform iteration of the Monte Carlo update algorithm on the sublat- 
tice^. 

^Possibly including communication with other processors. 
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If (m mod riresize) = then 



{ 



Measure the wall-clock time At**^'^ spent on processor i for doing 
the calculations excluding the time spent for communications. 
Calculate 

J-tMC 

ptMC i n 8) 

to measure the actual resources of each node of the multiprocessor 
system. 

Communicate the results to all processors of the multiprocessor sys- 
tem. 

Calculate the new characteristic sizes 

Ll^'c+i Qf ^j^g sublattices with 
^^"'^" = yn' p,,,, L (. = l,..,n-l) (19) 

l—i'n=l n 



and 



n-l 



LtMc+i = L-Y. Li'''^'- (20) 



n=l 

Resize the sublattices if 

> eL. (21) 

This step may include the communication of parts of the sublattices 
between the processors and is certainly the critical part of the algo- 
rithm. We will introduce an algorithm for this resizing for a special 
case below. 

Set tMC = tMC + 1- 



} 

In the sequel we present an algorithm for resizing the sublattices for the special case that 
the splitting of the sublattices takes place only in one dimension. We use the host-node 
(respectively client- server) parallel programming paradigm (see |^), associating each 
sublattice with a server process and leaving the more administration oriented tasks 
(like reading the global parameters of the simulation, starting the server processes, 
etc.) to the host proces^. Let us assume the size of the lattice in the direction of 
the splitting to be L and that the host process holds arrays a[i\ with (z = 1, . . . , n -|- 1) 
for the "Monte Carlo times" tMc and tMc ~ 1 containing the first coordinate in that 
direction of the "slice" of the lattice associated with each processor: 

a[l] = 1 < a[2] < ... < a[n + 1] = L + 1. (22) 



^'^Of course these tasks could also be done by the server processes, resulting in the hostless paradigm 
of parallel programming. Therefore, our algorithm could also be implemented in a hostless model and 
our limitation to the host-node model is not a loss of generality. 
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(In terms of the constants {q} this would mean Cj = ^t±JJ_^/) Now the host process 
sends messages containing instructions to the node processes in two passes: 

1. For i = 2, . . . , n 

{ 

if ((rf := at^^^[{\ - at,,^_i[i]) > 0) 
{ 

send message to server i telhng it to send its "first" d shces to processor 
i - 1 

} 

} 

2. For i = n — l,n — 2,...,1 

{ 

if (d := at„^[2 + 1] - at^,^_i[i + 1]) < 
{ 

send message to server i telhng it to send its "last" d slices to processor 
i + l 

} 

} 

The node processes wait for messages from either the host process or from neighbour- 
ing node processes. If there are not enough slices available on a node process to be 
sent, the node process waits for a message from a neighbour node process to receive 
additional slices. The two pass algorithm prevents deadlocks. 

If the resources of the processors of the multiprocessor system change very rapidly, 
a multiple communication of data may be necessary and will drastically reduce the 
efficiency of this algorithm. But this is consistent with the fact, that our complete 
approach to dynamic load balancing is anyhow only valid for systems with moderatly 
varying resources, as was already pointed out at the beginning of section 2, see (^. 



4 Results for the Two— Dimensional Ising Model 

The above described algorithm has been implemented for the parallelized simulation 
of the two-dimensional Ising model on a cluster of four IBM RISC System/6000 - 
550 workstations , using the P VM programming environment |p, |T^ . Here we have 
a two-dimensional lattice which is divided into stripes. The objects defined on the 
lattice sites are spins (i.e. binary variables) and an iteration defined on these objects 
consists e.g. of a Metropolis algorithm to generate a new spin configuration on the 
lattice. Each stripe is associated with one workstation. The characteristic scales of the 
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stripes are their widths and the characteristic scale of the lattice is the sum of all widths. 



The cluster being completely homogeneous, the heterogeneous situation has been sim- 
ulated by starting independent processes on one or several nodes of the cluster. This 
allows the heterogenity of the multiprocessor system to be introduced in a controlled 
manner]^, i.e. to vary the homogenity H and measure (16) resp. (|l^) as functions of H . 
Our results are presented in figure ^ for a 1000 x 1000 and a 2000 x 2000 lattice. One 
clearly sees the qualitative agreement with the prediction of our performance model, 
see figure pp| . 
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Figure 2: Performance measured for 4 processors with and without load balancing. 



A different point of view consists of looking at the (mega) updates done by the Metropo- 
lis algorithm on each spin per second ( "MUps" )0. These are presented for a 1000 x 1000 
and a 2000 x 2000 lattice as a function of H in figures § and § with the dynamic load 
balancing being done after a certain number of sweeps. It turns out, that the optimal 
number of sweeps between the load balancing to be performed depends on the size of 

^^During the measurements cited below, the cluster has been dedicated to our application. 

^^Considering the fact, that we have not included the time spent for communication in our model, 
a quantitative agreement between the theoretical and measured performance cannot be expected. An 
inclusion of the communication in our model would be very difficult and highly system dependend, 
e.g. because system parameters like latency and bandwidth may be be complicated functions of the 
homogenity H. 

-'^^These "MUps" constitute a benchmark for spin models. 



9 




0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 

Homogenity 



Figure 3: Performance measured in MUps for 4 processors for a 1000 x 1000 lattice 
with and without load balancing being done after a certain number of sweeps. 

the problem. 
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Figure 4: Performance measured in MUps for 4 processors for a 2000 x 2000 lattice 
with and without load balancing being done after a certain number of sweeps. 
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5 Summary 



We have introduced an algorithm for dynamic load balancing for synchronous Monte 
Carlo simulations on a heterogenous multiprocessor system with distributed memory. 
Implementing this algorithm for the two-dimensional Ising model, we have shown, that 
it may result in a speedup of a factor 5-6 for the above described class of geometrically 
parallelized algorithms. In many cases, the implementation of the algorithm is straight 
forward with only little overhead in calculation and communication. For homogeneous 
systems, almost no performance is lost because the algorithm detects that no resizing 
is necessary by applying (|2T|) . For systems with slowly changing heterogenity|^, the 



algorithm converges very fast and the requirements of the algorithm concerning the 
computing environment are minimal: the system only has to provide a routine to mea- 
sure the wall-clock time; such a routine should be available on all operating systems. 



Considering the generality of the algorithm introduced above, it may also be useful 
applied to problems other than Monte Carlo simulations, e.g. in parallel iterative 
methods for solving linear or nonlinear equations appearing in engineering problems^. 



^^compared to the time needed for one iteration (sweep) 

^^Here the domain consists of a lattice, with a matrix element being associated with each of the 
nodes of the lattice. 
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